[iris] LogPusher: background drain thread, self-heal via resolver, never drop on send failures#4866
[iris] LogPusher: background drain thread, self-heal via resolver, never drop on send failures#4866
Conversation
Give LogPusher an optional str->str resolver (e.g. resolves iris://system/log-server via the controller's list_endpoints RPC). Cached RPC client is invalidated on retryable push errors; the next push re-resolves. Worker creates one pusher that survives across lifecycle resets instead of rebuilding on every heartbeat-timeout iteration, which is what let the silent log-mute bug slip through when heartbeats recovered after a controller restart. Fixes #4860
|
Claude finished @rjpower's task in 5m 24s —— View job Code reviewNo issues found. Checked for bugs and AGENTS.md compliance.
Notes from the review:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 877c3c957d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| self._log_pusher = LogPusher( | ||
| "iris://system/log-server", | ||
| interceptors=log_interceptors, | ||
| resolver=self._resolve_log_service, |
There was a problem hiding this comment.
Keep endpoint resolution off synchronous emit path
Passing resolver=self._resolve_log_service here moves a controller RPC into LogPusher._send, which is invoked inline from push() when a buffer reaches batch_size. Because the worker controller client is created with a 60s default timeout, a controller outage can block the caller thread that emitted the triggering log line for up to a minute. Before this change, endpoint lookup happened in lifecycle attach (not on the log emit path), so noisy task logs could not stall worker/task execution this way.
Useful? React with 👍 / 👎.
…attach/rename Move LogPusher construction out of _attach_log_handler and into start() alongside the controller client. Attach logic becomes one short branch on (handler is None). Resolver honors the server_url it's given instead of ignoring it.
Rewrite LogPusher with a dedicated drain thread. push() is non-blocking; the drain thread wakes on batch_size, flush(), close(), or flush_interval and sends what's buffered. All send failures re-buffer at the head of the key's deque — retryable and non-retryable alike. MAX_LOG_BUFFER_SIZE (10k) is the only path that drops entries, via oldest-first eviction.
|
🤖 Specification (required for PRs over ~500 LOC). Problem Worker at `lib/iris/src/iris/cluster/worker/worker.py` resolved the log service Observed in #4860: worker Second problem: the old LogPusher did sync RPC sends inline from `push()` Approach Single `LogPusher` class, no new public types. Changes:
Key code Drain loop: ```python Send — every failure rebuffers, only retryable failures invalidate: ```python Worker attach: ```python Tests `lib/iris/tests/test_remote_log_handler.py` covers:
`lib/iris/tests/cluster/worker/test_worker.py` covers: attach-once-key-rename 84/85 of the relevant tests pass locally (1 skip unrelated). Pre-commit |
Replace the fixed 1s wait between failed send attempts with rigging's ExponentialBackoff (0.5s → 30s, factor 2). A push()-triggered notify would otherwise cut the wait short and turn a steady stream of failures into a hot loop. Backoff resets on any successful batch send.
Collapse per-key deques into one FIFO of (key, entry) pairs. Trimming on overflow is a single popleft. Rebuffering on send failure is appendleft of the untried tail in original order. The drain thread groups by key just before sending — one push_logs RPC per key, same as before.
Drop the resolver-None branches. When no resolver is provided, default to an identity function; _get_client and _invalidate work the same either way. Static-URL pushers now heal stuck TCP connections by rebuilding the RPC client against the same address on retryable failures.
Set propagate=False on the module logger and install a direct stderr handler. Otherwise our own "send failed" warnings would reach the RemoteLogHandler on the root logger and feed right back into the pusher we serve — a re-entrant amplification loop during failure storms. Also remove the duplicate close check after the backoff wait. The outer retry loop's rebuffer/close branch already handles it; one extra send attempt on close is fine (close() joins with a generous timeout).
Compress the class docstring to a 4-sentence sketch — the implementation detail in the drain-thread bullets was out of place for an API doc. Fold test_retryable_failure_rebuffers_at_head, test_retryable_error_ invalidates_and_reresolves, test_non_retryable_error_rebuffers_without_ invalidating, test_resolver_raising_is_retried, and test_static_url_ pusher_reconnects_on_retryable_failure into a single parameterized test_failures_always_deliver_via_retry with four scenarios. The delivery guarantee is the real claim; per-scenario side effects (invalidate/no-invalidate, rebuild/same-client) stay as branch-specific assertions.
Replace ``while True`` + mid-loop close returns with ``while not self._closed``. The close check after rebuffer is gone; the backoff wait still breaks on close via ``_closed``, and the outer loop then exits naturally at the next iteration. Semantics unchanged — one best-effort drain pass on close, then exit.
yonromai
left a comment
There was a problem hiding this comment.
Approved with 2 follow-ups called out inline.
LogPusher.close()can still close the client before an in-flight send returns.LogPusher.flush()is now non-blocking, which changes the existing task-cleanup contract around final log delivery.
Validation: uv run --group dev python -m pytest tests/test_remote_log_handler.py tests/cluster/worker/test_worker.py -q (51 passed; 1 existing thread-leak warning in test_stop_preserve_containers_does_not_kill_tasks).
Generated with Codex.
| Non-blocking. For draining on shutdown, use ``close``. | ||
| """ | ||
| with self._cond: | ||
| if self._queue: |
There was a problem hiding this comment.
P2: flush() is now explicitly non-blocking, but an existing call site still relies on the old synchronous behavior during task cleanup: TaskAttempt._cleanup() uses self._log_pusher.flush() so buffered logs reach the server before cleanup continues (task_attempt.py#L935-L942:
marin/lib/iris/src/iris/cluster/worker/task_attempt.py
Lines 935 to 942 in f5190c7
flush() only notifies the drain thread and returns, so short-lived tasks can finish cleanup with their final logs still queued. Could we preserve blocking flush() semantics, or add a synchronous drain method for the task-cleanup path?
Generated with Codex.
|
@claude address the PR comments and failing test: =========================== short test summary info ============================
Let's make flush() block, it's fine for close() to be best-effort, ignore that PR comment. |
|
Claude finished @rjpower's task in 15m 19s —— View job Done —
|
Adds a per-entry monotonic seq to the queue. `flush(timeout=None)` snapshots _pushed_seq and waits on the condvar until the drain has advanced _processed_seq past it, where "processed" means either sent or overflow-dropped. Dropped entries advance the marker so a flush doesn't wait forever on entries that will never reach the server. Makes task_attempt._cleanup's `self._log_pusher.flush()` actually wait for final logs to ship, restoring the pre-rewrite contract. close() stays best-effort per directive. Also replaces the _wait_for polling in test_failures_always_deliver_via_retry with deterministic flush(timeout=...) synchronization — removes the xdist- load flake that showed up as assert 2==1 / 6==1 on the seed/seed-then-push race. Co-authored-by: Russell Power <rjpower@users.noreply.github.com>
Rewrite LogPusher with a dedicated drain thread. push() is non-blocking;
the thread wakes on batch_size, flush, close, or flush_interval and sends
what's buffered. Every send failure rebuffers at the head of the key's
deque and backs off — retryable errors additionally invalidate the cached
RPC client so the next attempt re-resolves. Only MAX_LOG_BUFFER_SIZE (10k)
overflow drops entries, oldest-first. Worker builds one pusher in start()
with a resolver that looks up /system/log-server on the controller.
Fixes #4860